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Abstract 

^^ We apply nonparamctric Bayesian methods to study the problem of estimating 

I— I the intensity function of an inhomogeneous Poisson process. We exhibit a prior on 

^H intensities which both leads to a computationally feasible method and enjoys desirable 

^0 theoretical optimality properties. The prior we use is based on B-spline expansions 

(-H with free knots, adapted from well-established methods used in regression, for in- 

stance. We illustrate its practical use in the Poisson process setting by analysing 
count data coming from a call centre. Theoretically we derive a new general theo- 
rem on contraction rates for posteriors in the setting of intensity function estimation. 
Practical choices that have to be made in the construction of our concrete prior, such 
as choosing the priors on the number and the locations of the spline knots, are based 
^ on these theoretical findings. The results assert that when properly constructed, our 

approach yields a rate-optimal procedure that automatically adapts to the regularity 
(^—^ of the unknown intensity function. 

• Keywords: Adaptive estimation; Bayesian nonparametric estimation; Contraction 

(^ rate; Markov chain Monte Carlo; Poisson process; Splines. 



1 Introduction 



/^ Poisson processes have a long-standing history and are some of the most widely used 

j^ processes in statistics to study temporal and spacial count data, in diverse fields such as 

communication, meteorology, seismology, hydrology, astronomy, biology, medicine, actu- 
ary sciences and queueing, among others. In this paper we focus on inhomogenous Poisson 
processes on the line with periodic intensity functions, which are models for count data in 
settings with a natural periodicity. 

Nonparametric Bayesian methods, which are used more and more in many different 
statistical settings, have so far only been used on a limited scale to analyze such models. 
From the applied perspective they can be attractive for making inference about intensity 



functions, for the same reasons they are appeahng in other situations. Estimating the in- 
tensity essentiaUy requires some sort of smoothing of the count data, and a nonparametric 
Bayesian approach can provide a natural way for achieving this. Using hierarchical priors 
we can automatically achieve a data-driven selection of the degree of smoothing. Moreover, 
Bayesian methods provide a way to quantify the uncertainty about the intensity using the 
spread of the posterior distribution. A typical implementation provides a computational 
algorithm that can generate a large number of (approximate) draws from the posterior. 
From this it is usually straightforward to construct numerical credible bands or sets. 

The relatively small number of papers using nonparametric Bayesian methodology for 
intensity function smoothing have explored various possible prior distributions on inten- 



sities. An early reference is M0ller et al. (1998), who consider log-Gaussian priors. Other 



papers employing Gaussian process priors, combined with suitable link functions, include 



Adams et al. (2009) and Palacios k Minin (2013). Kottas & Sanso (2007) consider kernel 



mixtures priors. See also the related paper DiMatteo et al. (2001), in which count data is 
analysed using spline-based priors. 

The cited papers show that nonparametric Bayesian inference for inhomogenous Pois- 
son processes can give satisfactory results in various applications. On the theoretical side 
however the existing literature provides no performance guarantees in the form of con- 
sistency theorems or related results. It is by now well known that nonparametric Bayes 
methods may suffer from inconsistency, even when seemingly reasonable priors are used 



e.g. Diaconis & Freedman (1986)). The purpose of this paper is therefore to exhibit a 



Bayesian approach to nonparametric intensity smoothing that is both computationally 
feasible and at the same time theoretically underpinned by results on consistency and re- 
lated issues like convergence rates and adaptation to smoothness. Such theoretical results 
have in the last decade been obtained for various statistical settings, including density es- 



timation, regression, classification, drift estimation for diffusions, etcetera (see e.g. Ghosal 



(2010) for an overview of some of these results). Until now, intensity estimation for inho- 



mogenous Poisson processes has remained largely unexplored. 

As motivation and starting point for the paper we consider the problem of analysing 
count data from a call center. The same type of data were analyzed by frequentist methods 



in the paper Brown et al. (2005). We revisit the problem using a nonparametric Bayesian 



method employing a spline-based prior on the unknown intensity function. In addition to 
a single estimator of the intensity, this method provides credible bounds indicating the 
degree of uncertainty. In Section[3]we study theoretical properties of our procedure, namely 
consistency, posterior contraction rates and adaptation to smoothness. The results show 
that we have set up our procedure in such a way that we obtain consistent, rate-optimal 
estimation of the intensity and that the method adapts automatically to the unknown 
smoothness of the intensity curve, up to the level of the order of the splines that are used. 
Section [4] concludes with some remarks and directions for further research. 



2 Analysis of call center data 



2.1 Data and statistical model 

The approach we propose and study is motivated by the wish to analyse a dataset con- 
sisting of counts of telephone calls arriving at a certain call center. The dataset was 



obtained from the website of the S.E.E. Center (http://ie.technion.ac.il/Labs/Serveng/) 



of the Faculty of Industrial Engineering and Management, Technion in Haifa, Israel. It 
consists of counts for calls arriving at a bank's 24 hour a day call center in the United 
States of America. We considered the records for the period from November 1, 2001 until 
December 31, 2001, covering a total of about 2.8 million incoming phone calls. These 
events are recorded in 30 second intervals with an average of about 32 calls per minute. 
The raw data are plotted in Figure [T} 



Number of calls since November 1 , 2001 




Figure 1: Number of incoming phone calls between November 1, 2001 and December 31, 
2001. 

We model the full count data as the realization of an inhomogenous Poisson process 



A'^ with an intensity function A that is periodic, the period being 24 hours (Daley &; 



m 



Vere-Jones (1988)). This Poisson assumption is natural and is investigated in some detail 



Brown et al. (2005|), who could not find significant evidence to the contrary in a similar 



dataset (same kind of data, but over a different time period). See also Belitser et al. 



(2012), who study the periodicity in the data. 



Let n be the number of days for which we have data (n = 61) and let T be the period 
(24 hours). Then the full call arrival counting process is given by A^ = (Nt : t G [0, nT]), 
where Nt is the number of calls arriving in the time interval [0, t]. The Poisson assumption 
means that for every < s < t, the number of arrivals Nf — Ng is independent of the history 
(A„ : u < s) up till time s and that is has a Poisson distribution with mean 



A(n) du. 



We will assume throughout that A is at least a continuous function. The periodicity 
assumption then means that A is a T-periodic function, i.e. A(t + T) = A(t) for all t > 0. 
For z = 1, . . . , n we define the counting process N^^' = [N^ : t £ [0, T]) by 



m 



(i) 



N{i~l)T+t 



^{i-l)T, 



tG[0,T], 



i.e. N^"^' counts the number of arrivals during day i. Note that by the independence of the 
increments of the process N, the processes A^'*' are independent inhomogenous Poisson 
processes which have the restriction of A to [0, T] as intensity function. 

Our goal is to make inference about this function. Note that we do not observe the 
full process N. We only observe it at discrete times, namely every 30 seconds. On average 
about 16 calls arrive in a 30 second time interval, so we really only see aggregated counts. 
Let A be the time between observations (30 seconds) and let m = T/A be the number 
of counts per day that we have in our dataset (m = 2880 in our case). Then for every 
i = 1, . . . ,n and j = 1, . . . ,m, the number of arrivals 



in the jth time interval on day i has a Poisson distribution with parameter 

Xj = / A(t) dt. 

J(j-1)A 

We denote the total available count data by C" = {Cij : i = 1, . . . ,n,j = 1, . , 
follows that the likelihood is given by 



(1) 

(2) 
,m). It 



A 'e~^^ 
In the following section we describe the prior we place on the intensity function A. 



(3) 



2.2 Prior on the intensity function 

There are different possible choices of priors for the function A. A number of earlier 
considered options were already mentioned in the introduction (Gaussian processes, kernel 
mixtures, splines). The prior we investigate in this paper is motivated by the desire to have 
a computationally manageable procedure on the one hand and theoretical performance 
guarantees on the other. Still there will conceivably be more than one sensible choice 
meeting these requirements. Here we restrict ourselves to the investigation of a specific 
spline-based prior which is described in detail in this section. 

More precisely we will employ a certain free-knot spline prior which is similar to priors 



considered earlier in different contexts (see for instance Smith & Kohn (1996), Denison 



et al. (1998), DiMatteo et al. (2001), or, more recently Sharef et al. (2010) and the ref- 



erences therein). Such priors have proven to be numerically attractive and capable of 



capturing abrupt changes in functions of interest. This last point is relevant for our par- 
ticular application, since we expect fluctuations during the day due to the varying activity 
of businesses over the day. Recently, several theoretical results were derived for spline- 
based priors in various setting as well 



(e.g. 



De Jonge & van Zanten (2012), Belitser & Serra (2013)). We will show in the next sec- 



Ghosal et al. (2000), Ghosal et al. (2008), 



tion that the procedure that we construct and implement has several desirable theoretical 
properties. 

Background information on splines can be found, for example, in de Boor (2001) or 
Schumaker (2007). Let us fix some notations and terminology. A function is called a 
spline of order q £ N, with respect to a certain partition of its support, if it is g — 2 
times continuously differentiable and when restricted to each interval in the partition, it 
coincides with a polynomial of degree at most q — 1- Now consider q >2. For any j G N, 



such that j > q let ICj 
refer to a vector k £ fC 



--{{h,...,kj.,) e {0,Ty-i : < ki < 
■J as a sequence of inner knots. 
A vector k G ICj induces the partition {[/cq, ^i), [^i, ^2), • • • 



< kj^q < 1}- We will 



\ki 



with kn = and ki 



T. For k G /C, 



_,+i]} of [o,r], 

Sg the linear space of 



-q, l^y 



vQ = u anu Kj-q-\-i = 1 . ror n t /v-j, we denote by S w 

splines of order q on [0, T] with simple knots k (see the definition of simple knots in. 

This space has dimension j and admits a basis of B-splines 

,k. 



e.g., 



. . . , kj. 
and T 
k 



Schumaker] p007| )). 

. . . ,Bj}. The construction of {Sf , 



, Bj} involves the knots k^ 



q+li ■ 



-q, l^j 



_g+i, kj-q-^-2, ... ,kj, with arbitrary extra knots A;_g+i 



kj-q+l < kj-q+2 < 



■ < kj. Usually one takes A;_g+i 



kj, and we adopt this choice as well. For k £ IC 



< ... 

• = k- 
and 6 E 



,ko,ki, 
< k^i < ko = 
-1 = ko = and 



V we denote 



by sg^fc the spline in S^ that has coefficient vector relative to the basis {-Bf , . . . , -B^}, 



I.e. 



se.kit) 



i=l 



To define our prior 11 on A we first fix the order g' > 2 of the splines that we use (cubic 
splines are popular, they correspond to the choice g = 4) and the minimum and maximum 
intensities < Mi < M2. Then a draw from the prior 11 is constructed as follows: 

1. (Number of B-splines): Draw J > q according to a shifted Poisson distribution with 
mean fi. 

2. (Location of the knots): Given J = j, construct a regular 1/j^-spaced grid in (0, T). 
Then uniformly at random, choose j — q grid elements (without replacement) to form 
a sequence of inner knots k. 

3. (B-spline coefficients): Also given J = j, and independent of the previous step, draw 
a vector 6 of j independent, uniform C/[Mi, M2]-distributed B-spline coefficients. 

4. (Random spline): Finally, construct the random spline sg^k of order q corresponding 
to the inner knots k and with B-spline coefficient vector 0. 

The specific choices made in the construction of the prior, like the Poisson distribution 
on J, choosing the knots uniformly at random from a grid, etcetera, are motivated by 



the optimality theory that we derive in Section [3| The theory shows that there is some 
more flexibihty, but for choices too far from the ones proposed above the performance 
guarantees brake down. Technically the prior on A is the measure 11 on the space C[0,T] 
of continuous functions on [0, T] given by the law, or distribution of the random spline 
se^k described above. The splines in Sg are q — 2 times continuously differentiable, hence 
in this sense the choice of q determines the regularity of the prior. We will see in the next 
section that it also determines the maximal degree of smoothness of the true underlying 
intensity to which our procedure can adapt. In applications like the one we are interested 
in here, a sensible choice of the parameters Mi and M2 will typically be suggested by the 
average number of counts per time unit in the data. In Section |4] we comment on other 
possibilities, such as putting a prior on these parameters as well, in order to let their value 
be determined by the data automatically. This is possible, but will come at an additional 
computational cost. The construction of the grid in step 2. is non-standard compared to 
other spline-based priors proposed in the literature. It is motivated by recent work of 



Belitser & Serra (2013) and will allow us to derive desirable theoretical properties in the 



next section. 



2.3 Posterior inference 

For the data described in Section 2.1 , with likelihood ([3]), and the spline prior 11 described 



in Section 2.2 we implemented an MCMC procedure to sample from the corresponding 
posterior distribution of the intensity function A of interest. The minimal and maximal 
intensity parameters Mi and M2 were set to 200 and 20000, respectively. These numbers 
were motivated by the range of the data (time is measured in hours). We took the order 
q of the splines equal to 4. 



Since our prior is very similar to the ones used previously in for instance DiMatteo 



et al. (2001) or Sharef et al. (2010) in regression or hazard rate estimation settings, our 
computational algorithm is a rather straightforward adaptation of existing methods. A 
generic state of the chain is a (2 J — q + 1) -dimensional vector (j, k, 0) where j GN, j > q 
is the model index, k = kj £ (0, Ty~'^ is a vector of inner knots and 6 = 6j £ W is a 
vector of B-spline coordinates. Together, these index a spline se k = si , £ Si . We 
will abbreviate the corresponding posterior distribution by 7r(j, fc, 0|C"). Since the splines 
involved are easy to evaluate and integrate we can compute the likelihood, and then the 
posterior, up to the normalization constant, without any approximations being needed. 

We consider four different types of moves for the MCMC chain, namely: a) perturbing 
the coefficients, b) moving the location of one knot, c) birth of a new knot and d) death of 
an existing knot. Each of these moves is proposed, independently and respectively, with 
probabilities pa, Pb, vAi) and pd(j) where for each j > g, pa + Pb + Pc{j) + Pd{j) = 1- 
In fact, we start by picking < pa + Pfe < 1 as parameters of the algorithm; if ^ is the 
mean of the prior on J, then we take Pc{q) = ^ — Pa — Pb, Pd{q) = and, for j > q, 

j — q j — g 

Pc{j) = (1 — Pa — Pb)2 ''"« and Pd{i) = (1 — Pa — Pb){^ — 2 f"?). This choice results in 
Pcip) = Pdif^) if i = /^, Pc(^) > PdifJ-) if J < fJ-, PcifJ') > Pdif^) if J < At- 

When perturbing the coefficients we perform simple (Gaussian) random walk MCMC 
steps; the standard deviation of the random walk was chosen such that we obtained an 



acceptance rate of roughly 23% for this type of move, as prescribed in Gelman et al. 



(1997). Let ipj be the joint density of j i.i.d. standard normal random variables. Our 
proposals correspond to a move (j, k, 9) — )• (j, k, 6 + au) which we accept with probability 
min (A(n), l), with 

TT{j,k,e + au\C'')paipj{-au) _ TT{j,k,6 + au\C'') 
A{U) — 



TT{j,k,9\C^)pafj{au) Tr{j,k,0\C^) 

Moving a knot is also straightforward; one of the current j — q knots, say ki, is picked 
uniformly at random among those in k, and we propose to change its location depending 
on how many of its neighboring position on the j~^-spaced grid are free - we say that two 
knots k, k' are neighbors if \k — k'\ < j~'^. This means that we propose a move (j, k, 6) — )• 
{j,k',9) where k and k' differ only at the i-th position: if ki has two free neighboring 
positions, then it moves to either of them with equal probability c, = Cj(A:j_i, ki, fej+i) = 
1/2; if ki only has one free neighboring position, then, with equal probability c, = 1/2, it 
either moves to this free position or it does not move at all; if ki has no free neighboring 
positions then if does not move, with probability q = 1. These particular choices assure 
the reversibility of the moves. We accept such a proposal with probability min {A{i), l) 
where A{i) is given by 

^(i) _ ^(J, (^1, . • • , fc^ . . . , k,_g), e\C^) p, {j - (?)-! c, _ 7r(j, k', 0\C^) 



7r(j, {ki, . . . , fci, . . . , kj.g), 6>|C") pb {j - q)-^ c, TT{j, k , e\C^) ■ 

Birth moves and death moves, where a new knot is respectively added and removed, 
are reverse moves of one another and so we will outline only how to perform the birth 
move. We propose a move {j,k,0) — )• (j + l,k',0') where we add a new knot to the 
vector k and a new coefficient to the vector 0. In doing so, a new B-spline is introduced 
to the B-spline basis and a new B-spline coefficient is generated. The new knot vector 
k' contains all knots from k rounded to the closest grid point on a (j + 1)~^ spaced grid 
with the extra knot then picked uniformly at random among the remaining free positions; 
call it ki^i < k' < ki. Note that this construction does not prevent two knots in k' 
from occupying the same position; such knot vectors have posterior probability 0, though, 
so that the probability of moving to such a state is zero. The coefficients on this basis 
are then picked as 6' = f{0, u) = (6*1, ... , Om-i-, u, 6m, ■ ■ ■ , dj) where / will be linear and 
invertible, and u is a random seed, a normally distributed random number with mean 
T]{0), to be picked later, and variance 1. The new knot will belong to the support of 
q B-splines, namely the i-th through {i + q — l)-th B-splines and we pick the index m 
in {i, . . . ,i + q} depending on the knot's position within the interval [ki-i, ki]; namely 
m = i + [{q + l){k' — ki^i)/{ki — ki-i)\, where [a\ is the largest integer smaller or equal 
to a. The mean of the random seed u will be picked as a weighted mean of the coefficients 
6, namely, rj{6) = YIT^ "^i^i + Z]i=m.^*-i^«' where the weights Wi are normalized and 

Wi^ [ Bl,Jt)Bl,.{t)dt, i = l,...,j + l. 

Jo •'' J' 



With probability min (^A{j , k' , u) , 1) we make the move {j,k,6) — )• (j + l,k',6'), with 
6' = f{0,u) and k' = (ki, . . . , ki-i, k' , ki, . . . , kj-q), where 



A{j,k',u) 



TTJj + l,k',f{e,u)\C^) {I - p, - p,)pa{j + 1) (j - g + 1) 
TT{j, k, e\C^) (I-Pa- Pb)PciJ) {P -j + q)-^Mu) 



\J 



f\ 



where \Jf\ is the Jacobian of the hnear mapping described before. 

Figure [2] summarizes the outcome of the analysis. In the top panel it shows the 
posterior and 95% point-wise credible intervals, based on 10,000 samples from posterior. 
The lower panel shows a histogram for the locations of the knots corresponding to the 
samples from the chain used to generate the top panel. Note that as expected, relatively 
many knots are placed in periods in which there are relatively many fluctuations in the 
intensity. 
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Figure 2: Top panel: posterior distribution of the intensity function A based on the 
thinned data. (Blue: posterior mean, red: point-wise 95% credible intervals). Lower 
panel: posterior distribution of the knot locations (Histogram). 



Due to the large amount of data (almost 3 million counts), the credible bands are very 
narrow. To illustrate the dependence on the amount of data we ran the analysis again 
with a thinned out dataset. We randomly removed counts, retaining about 1,000 counts. 
The same analysis then leads to the posterior plot given in Figure [3j In this case, the 
uncertainty in the posterior distribution becomes clearly visible. 



We find that the prior that we defined in Section 2.2 is a computationally feasible choice 
for nonparametric Bayesian intensity smoothing in the context of this kind of periodic 
count data. In the next section we analyze its fundamental theoretical performance. See 
in particular Theorem [3] in Section [3. 2[ 
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Figure 3: Top panel: posterior distribution of the intensity function A based on the 
thinned data. (Blue: posterior mean, red: point-wise 95% credible intervals). Lower 
panel: posterior distribution of the knot locations (Histogram). 



3 Theoretical results 

3.1 Contraction rates for general priors 

We derive our theoretical results for the particular prior we used in the Section [2] from 
general rate of contraction results that we present in this section. These are in the spirit 
of the general theorems about convergence rates of nonparametric Bayes procedures that 



were first developed for density estimation (Ghosal et al. (2000)) and later for various other 



statistical settings, see for instance van der Meulen et al. (2006), Ghosal & van der Vaart 



(2007), Panzar & van Zanten (2009). Here we complement this literature with general rate 



results regarding intensity estimation for inhomogenous Poisson processes. These results 
are not only applicable to the spline priors we consider in this paper, but may also be 
used to analyze contraction rates of other priors. Moreover, we formulate the theorems 
not just for the case that we have discrete observations of aggregated data, as in our data 
example, but also for the case that the full counting process is observed. 

The setting is as in Section 2.1 We fix a period T > 0. In the full observations case 
we assume that for n G N, we observe an inhomogeneous Poisson process N"^ = [N^ : 
t G [0,nT]) up till time riT, with a T-periodic intensity function A. Equivalently, we can 
say we observe n independent inhomogeneous Poisson processes N^^\ . . . , A^'"-*, indexed 
by [0,r], and with a common intensity function A, which is a positive, integrable function 
on [0,r]. It is well known that the law of N under the intensity function A is equivalent 
to the law of a standard Poisson process and that the corresponding likelihood is given by 



p(iV" I A) = e-/o"^(^W-i)"'*+/o"^i°s(^W)'^^t'' 



(see for instance Jacod & Shiryaev (2003)). We consider prior distributions that charge 
strictly positive, continuous functions. Given such a prior n„ on A (which we allow to 
depend on n) we can then compute the corresponding posterior distribution n„(- | A^") 
by Bayes' formula 

Formally we can view the prior n„ as a measure on the space A C C[0, T] of all continuous, 
strictly positive functions on [0,r], endowed with its Borel cr-field. If we endow A with 
the uniform norm, the likelihood is a continuous function on A. Hence, the posterior is a 
well-defined measure on A. 

The following theorem considers the frequentist setting in which the data are assumed 
to be generated by an unknown, "true" intensity function Aq. It gives conditions on the 
prior n„ under which the posterior Iin{- \ A^") contracts around the true Aq at a certain 
rate as the number of observed periods tends to infinity. The assumptions and conclusions 
of the theorem are formulated in terms of various distances on the intensity functions. For 
a continuous function / on [0, T] we define the norms ||/||2 and ||/||oo as usual by 

\\f\\i= r f{t)dt, ii/iioo= sup i/(t)i. 

Jo te[o,T] 

For a set of positive continuous functions J- we write J-'^ for its complement and v^ = 
{■^Z/ : / € J-}. For e > and a norm || • || on J^, let N(e,J-, \\ ■ ||) be the minimal number 
of balls of II • 1 1 -radius e needed to cover J^. 

Theorem 1 (Contraction rate for full observations). Assume that Aq is bounded away 
from 0. Suppose that for positive sequences en,£n — ^ such that n{en A e„)^ —)• oo as 
n — )• oo, and constants ci,C2 > ii holds that for all C3 > 1, there exist subsets A„ C A 
and a constant C4 > such that 

n„(A : ||A - Aolloo < En) > cie-^^'^^'" , (4) 

n„(A^)<e-^««^", (5) 

log A^(e„, \/A„, II • II2) < C4ne^. (6) 

Then for £n = Sn'^ £n o^nd all sufficiently large M > 0, 

EAon„(A G A : II^/A - VAqHs > Me„ | iV") ^ (7) 

as n ^ CO. 



The proof of this theorem is given in Appendix A.l The assumptions of the theorem 
parallel those of similar theorems obtained earlier for other settings including density 
estimation, regression, and classification. The first condition Q, the so-called prior mass 
condition, requires that the prior puts sufficient mass near the truth. Conditions ([5|-([6]) 
together require that most of the prior mass, quantified in the sense of the remaining 



10 



mass 



condition ([s]) , is concentrated on sieves A„ which are "smah" in the sense of metric 
entropy, quantified by the entropy condition ([6]). 

The proof of the theorem shows that conditions ([5])-([6| can in fact be slightly weakened, 
at the cost of using more complicated distance measures on the intensities. The conditions 
in the theorem are more easy to work with when studying concrete priors and are expected 
to give sharp results in many cases. We note that if under the prior all intensities are 
bounded away from 0, then the set \/A^ in ([g]) may be replaced by A„. Moreover, if 
all intensities are uniformly bounded by a common constant under the prior, then the 
square-root norm ||-v/^||2 in U] may be replaced by the L^-norm || • ||2 itself. In the next 
section we verify the conditions of the theorem for the spline priors used in Section |2.1[ 

In the discrete observations case we only observe, for some m £ N and A = T/m, 
aggregated counts Cij for i = 1, . . . , n and j = 1, . . . , m, given by (IT]). As before, we 
summarize these data using the notation C" = {Cij : i = l,...,n,j = l,...,7n). As 
explained in Section 2.1 the likelihood is in that case given by ([S]), where the Aj's are 



defined as in ([2]) . Consequently, the discrete-observations posterior is given by 

n„(A £B\C" 



J^p{c^ \ x)nn{d\) 



Jp{C^\X)Un{dX) ■ 

In this case it is clear that we can not consistently identify the whole intensity function 
A from the data, but only the integrals Ai, . . . , Am- In the following theorem, which deals 
with the convergence of the posterior distribution in the case of discrete observations, we 
therefore measure the convergence using a semi-metric that identifies intensity functions 
with the same integrals over time intervals in which we make observations. For A, A' G A, 
we define the distance p by setting 



™ / / ^x 2 JIL , / fiA / fjA 2 

p\\, A') = Y, (A - a/a;) = E (\/ / ^(*) ^*-\ ^'(*) dt) ■ 

The theorem has exactly the same assumptions on the prior as Theorem [T] above, but 
gives a contraction rate relative to the distance p. 

Theorem 2 (Contraction rate for discrete observations). Assume that Aq is bounded 
away from 0. Suppose that for postive sequences e„,, e„ — ;■ such that n{en A e„)^ —)• oo as 
n -^ oo, and constants ci,C2 > zi holds that for all C3 > 1, there exist subsets A„ C A 
and a constant C4 > such that Q -([g]) hold. Then for En = £n^ £n ind all sufficiently 
large M > 0, 

Ex,Un{X G A : p(A, Aq) > Me„ | C") ^ 

as n —;■ 00. 

The proof of the theorem is given in Appendix A. 2 



3.2 Contraction rates for our spline prior 

Having the general rate of contraction results given by Theorems [l] and [2] at our disposal 
we can use them to study the performance of the spline-based prior defined in Section |2.2[ 
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We fix the order g > 2 of the sphnes that are used. As before, let A^" be the a full path 
up till time riT of an inhomogenous Poisson process N with T-period intensity Aq and let 
C" be the discrete-time counts C" = {Cij : i = I, . . . ,n,j = 1, . . . , m), with Cij as in Q. 
The contraction rate of the posterior will depend on the regularity of the true intensity 
function, measured in Holder sense. For a > 0, let C"[0,T] be the space of functions 
on [0, T] with Holder smoothness a. (For [aj the greatest integer strictly smaller than 
a, having / S C"[0, T] means that / has [aj derivatives and that the highest derivative 
j(M) ig Holder-continuous of order a — [a\.) 

Theorem 3 (Contraction rate for the spline prior). Assume the true intensity function 
Xq belongs to C"[0,T] for some a G {0,q], and Mi < Xq < M2. Consider the prior H 
constructed in Section \2.S\ For all p > 1 and all sufficiently large M > we have 



and 



EA„n„(A G A : ||A - A0II2 > m(-^) '^'" I A^^ 

V log^ n J 



E;,„n„(A G A : p(XM > Mf-^Y'^'" I C") 

V log^ n J 



as n —)• 00. 



Observe that up to a logarithmic factor, the rate of contraction in the theorem is the 
optimal rate n""'^^"'"'^"' for estimating an a-regular function. Moreover, the prior does 
not depend on a. Hence the procedure automatically adapts to the smoothness of the 
intensity function, up to the order of the splines that are used. The theorem deals with 
the case that we have known bounds M\ and Mi for the intensity. Generalisations to less 
restrictive settings are possible, see the remarks in Section |4j 

4 Concluding remarks 

In this paper we exhibit a specific spline-based prior for doing nonparametric Bayesian 
intensity smoothing for inhomogeneous Poisson processes. We show that the method is 
both practically feasible and is underpinned by theoretical performance guarantees in the 
form of adaptive rate-optimality results. 

Extensions of our results in several directions are possible. In particular, with more 
work it is possible to drop the assumption that we know an a-priori bound on the unknown 
intensity, which may be undesirable or impossible in certain situations. An obvious exten- 
sion is then to put a prior on the bound. Computationally this makes the procedure more 
demanding, but numerical investigations indicate it is still feasible. Theoretical results 
can be obtained for that more general setting as well. Among other things this involves 
an extension of Lemma [T} Having a prior on the upper bound for Aq may deteriorate 
the convergence rate however. The optimal rate will only be attained (up to lower order 
factors) if the prior on the bound has very thin tails. 

Another desirable theoretical extension would be to obtain "local" rate of convergence 
results. Our present results deal with global norms on the intensity functions. It is 
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conceivable however that convergence is faster in regions where the intensity fluctuates 
relatively little, and faster in others. More work is necessary to derive theorems that 
describe this phenomenon. 



A Proofs 

A.l Proof of Theorem [T] 

A useful observation is that we can view the statistical problem to which the theorem 
applies as a density estimation problem for functional data. Indeed, in the full observations 
case we observe a sample N^^', . . . , A/^'"), which are independent, identically distributed 
random elements in the Skorohod space D[0, T] of cadlag (right-continuous functions with 



left-hand limits) on [0,T] (see Jacod &: Shiryaev (2003), Chapter VI). Under the intensity 
function A, the density px of N'^^^ relative to the law of a standard Poisson process indexed 
by [0, T] is given by 



Px{N) 



-/(/(A(t)-l)dt+/„-' \og{\it))dNt 



(e.g. Jacod Sz Shiryaev (2003), Chapter III). Hence, the density estimation results of 



Ghosal et al. (2000), Ghosal & van der Vaart (2001) apply in our case. 



We want to apply Theorem 2.1 of Ghosal & van der Vaart (2001). This gives condi- 



tions for posterior contraction rates in terms of the Hellinger distance on densities and 
other, related distance measures. The Hellinger distance h{p\,p\i) is in our case given 
by h'^{p\,px') = 2(1 — Ea' y7)a(A0/pv(^V) ) 1 where Ea is the expectation corresponding to 
the probability measure Pa under which the process A is a Poisson process with inten- 
sity function A. The other relevant distance measures are the KuUback-Leibler divergence 
K{p\,pxi) = —Kylog{px{N)/px'{N)) between px and px' and the related variance mea- 
sure V{px,Px') = VarA' log{px{N) / py (N)) . For a Poisson process N with intensity A and 
a bounded, measurable function /, we have 

E / fit) dNt = f f{t)X{t) dt, 
Jo Jo 

' fit)X{t)dt, 



Var / fit) dNt 











Ee/o^^W'^^* = e-/6^(i-<=^P(/W))^W«'*. 



Using these relations it is straightforward to verify that we have 
h iPx,Px') = 2(1 - e 2 •'0 VV V 7 V w; )^ 
Kipx,Px') = 



A(t) 



(A(t) - A'(t)) dt + / X'it) log ^ dt, 
Jo '' 

T 

^ y\ I I. I 

dt, 




Vipx,py)= I A'(t)log2|^ 
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respectively. 

The following lemma relates these statistical distances between densities to certain 
distances between intensity functions. We denote the minimum and maximum of two 
numbers a and hhy a /\h and a V 6, respectively. 

Lemma 1. We have the inequalities 

-^(liyi-^/VllsAl) </i(pa,PA')< V2(||yr-VV||2Al), 

K{px ,PX') < 3\\y^-Vyg + Vipx ,px'), 

\\^-Vyf2<\ l\xit) V x'{t)) iog2 M dt. 

Proof. The inequalities for h follow from the fact that (1/4) (x A 1) < 1 — exp(— x/2) < xAl 
for X > 0. 

For the Kullback-Leibler divergence we have 

K{px,px')= I X'{t)f{X{t)/X'{t))dt, 
Jo 

for f{x) = x — 1 — logx. By Taylor's formula, |/(x)| is bounded by a constant times {\/x — 
1)^ in a neighborhood of 1. Since |/(x)| is bounded by |x| for x > 1 and \x\/{^/x — 1)^ — )• 1 
as X — )• cxD, we have in fact |/(x)| < 3{^/x — 1)^ for all x G (l/e,oo), say. For (0, 1/e) we 
have |/(x)| < | logx|. It follows that 



Kipx,px')<3[ (y^) - ,/y(r)f dt + [ X'{t) 

Jx/X'>l/e Jx/X'<l/e 



, A(t) 



dt. 



The first term on the right is bounded by 3|| vA — VaU^. For the second term we note that 
for A/A' < 1/e, we have log|A/A'| > 1 and hence log|A/A'| < log^ l-^/^^'l- The statement 
of the lemma follows. 

To prove the last inequality, write || V A — vA^Hi as the sum of an integral over the set 
{A' < A} and an integral over the set {A' > A} and use the fact that 1 — x < | logx| for 
xG(0,l). D 

To connect assumptions d4])-([6| to the corresponding assumptions of Theorem 2.1 of 



Ghosal & van der Vaart (2001) we first note that since Aq is bounded away from and 
infinity by assumption, the same holds for any A G A that is uniformly close enough to 
Aq. The lemma and the definition of V therefore imply that for A uniformly close enough 
to Aq, both K{px,pxq) and V{px,pxo) ^^^ bounded by a constant times the uniform norm 
||A — Aolloo- It follows that for n large enough, the Kullback-Leibler- type ball 

B{en) = {A E A : K{px,pxo) < el,V{px,PXo) < el} 

is larger than a multiple of the uniform ball {A G A : ||A — Ao||oo < En}- The lemma also 
implies that the covering number N{£n, {px '■ A G A^}, h) is bounded by N{en/V^, \/An, || • 
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II2). Hence, assumptions Q-® imply that the conditions of Theorem 2.1 of Ghosal & 
van der Vaart (2001 ) are fulfiUed. This theorem states that for M large enough, "Kx^Hni,^ '■ 
hip\,Pxo) > Msn) — >• 0. To complete the proof, note that by the fact that Msn < 1 for 
n large enough and the first inequality of the lemma, it holds, for n large enough, that 
||\/A - \/Ao||2 > \/2Me„ implies that h{px,pxo) > Men- 



A.2 Proof of Theorem H 

The proof is similar as the proof of Theorem [T| but this time we start from the observation 
that in the discrete-observations case, the data constitute a sample of n independent, 
identically distributed random vectors C^^\ . . . , C^'^> in M"^, where 

C^^' = (Q]^, . . . ,Cim) 

and Cij is given by M. The coordinates Cij of C'*' are independent Poisson variables 
with mean Xj given by ([2]). 



Again we apply Theorem 2.1 of Ghosal &: van der Vaart (2001). In this case the 



Hellinger distance /im, Kullback Leibler divergence K^ and variance measure Vm are 
easily seen to be given by 



/ii(A,A') = 2(l-e-i^(^^-^^) ), 



A' 



K^{\, A') = ^(A, - a;.) + Y, a;, log f, 



K,(A,A') = J]A;.log- 



,2^ 
A,' 



respectively. These quantities satisfy the same bounds as in Lemma [T| but with the 
integrals replaced by the corresponding sums. Moreover, by expanding the square and 
using Cauchy-Schwarz we see that 



E 



and hence also 

K^(A,A')<4J^ 



A' 



A,- A a;. 



< 



<4 



^ Il2) 



lA'l 



infi|A(t)|Ainfi|A'(i)| 



Using these relations the proof can be completed exactly as in Section A.l 



^ lb- 



A.3 Proof of Theorem [3] 

Under the prior 11, the number of knots J has, by construction, a shifted Poisson distri- 
bution. By Stirling's approximation, this implies that for large j, 



>J) 



< 



-cij] 



¥{J = j)>e-'^^'^^^ 
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for some ci,C2 > 0. For the sequence of inner knots k constructed in the definition 
of the prior we have that the mesh width M{k) = max{|A;j — A;j-i|} and the sparsity 
m{k) = min{|fcj — kj-i\} satisfy 

P(m(fc) < r"^ I J = j) = 0, P(M(fc) < 2/j \J = j)> e-^^°^K 

The first of these facts follows trivially from the construction, the second one by bounding 
the probability of interest from below by the probability that every of the consecutive 
intervals of length T/j contains at least one knot. For the B-spline coefficients we have, 
by independence, 

\e-eo\\oo<e\J = j)>e^ 



for all Oo G [Mi,M2]-'. Theorem 1 of Belitser fc Serra (2013) deals exactly with this 



situation. In the present setting the theorem asserts that if Aq € C"[0,T] and Mi < Aq < 
M2, then for Jn, Jn > Q and positive e„ > En such that e„ — )■ 0, ne^ — )• cxd and 

i/a - - 1 J^ 2 

< Jn, logJ„<log— , J„log— <ne„, ne„<J„logJ„, (8) 



then there exist function spaces (of splines) A„ and a constant c > such that 

n(A:||A-Ao||oo<2e-„)>e-^-^"^°Si^, (9) 

n(A0A„)<e-'=i"^"", (10) 

logiV(e„,A„,||-||2) <ne2. (n) 

Now observe that the first two inequalities in ([8]) hold for 

_ g _ 1 

£n = n i+2°=log^n, J„ = Kni+2a log'^n, 

provided K is large enough and q > —p/a. The third and fourth inequalities then hold 
for 

1 a 

J„ = Lni+2" log^n, e„ = n i+a^log'^n 

if L is large enough and 2p < r + 1 < 2s. To complete the proof we have to link ([9|-(11) 
to the conditions Q-ilG]) of Theorems IT] and pi Note that since ([5]) should hold for all 
C3 > 0, we need to have 

For our choices of J„ and e„ this holds if 2p > q + 1. This amounts to choosing p > 
q/(1 + 2a). Then if we define 

/Z^ 1 

£n = \ log — 

V ^ £n 

the right-hand side of ^ equals exp(— 2ne^). Moreover, it holds that e„ ~ n~°/(^"'"^")(logn)'^'^"'"^)/^, 
so if we make sure that p > {q + l)/2, the desired inequality Q holds. The considerations 
above imply that ([5| then holds as well, for any C3 > 1. Recall that we found that the 
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entropy condition holds for e„ ~ n "''^+^")(logn)'^, provided s > p. This means that we 
should choose p, q, r and s above such that 

a 1 

p>- -— , r = 2p — 1, s > p, q 



l + 2a " ' "' " l + 2a 

Since the intensities in A„ are uniformly bounded by a common constant (see the proof of 
Theorem 1 of Belitser & Serra (2013[)), (11) implies that ([6| is fulfilled. 
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